Port Qwen2.5-VL Model by jaytiwarihub · Pull Request #2574 · keras-team/keras-hub

jaytiwarihub · 2026-02-04T14:10:46Z

PR Title

[Model] Add Qwen2-VL Model Architecture and Preprocessing

PR Description

What does this PR do?
This PR implements the Qwen2-VL (Qwen2 Vision-Language) model architecture in Keras 3. Qwen2-VL is a state-of-the-art multimodal model that introduces "Naive Dynamic Resolution" support, allowing it to process images of arbitrary aspect ratios by converting them into dynamic grids.

Key Components Implemented:

Qwen2VLVisionEncoder: A 3D Vision Transformer backbone that supports 3D convolution patch embeddings and 3D Rotary Positional Embeddings (RoPE) to handle video and dynamic image inputs.
Qwen2VLImageConverter: A preprocessing layer that implements the "Smart Resizing" logic, resizing images to optimal grid sizes based on the patch size (14x14) to minimize padding.
Qwen2VLProjector: A lightweight MLP adapter that projects visual features into the LLM's embedding space.
Qwen2VLCausalLM: The end-to-end model class connecting the vision tower with the Qwen2 text backbone.
Qwen2VLCausalLMPreprocessor: The high-level preprocessor handling text tokenization and image tensor conversion.

Technical Details:

3D RoPE: Implemented custom rotary embeddings that account for Time, Height, and Width dimensions relative to the grid structure.
Dynamic Resolution: The vision encoder accepts inputs with variable spatial dimensions (processed via the image converter), enabling the model to "scan" images in their native aspect ratios.
MHA Compatibility: The attention mechanism creates standard Query/Key/Value splits compatible with Keras MultiHeadAttention.

Tests:

Added unit tests for all components (backbone_test, projector_test, image_converter_test).
Added an integration test (integration_test.py) verifying the end-to-end flow from raw text/image input to preprocessed tensors.
Verified shape correctness for both 3D (static image) and 5D (video) input tensors.

Reference:

gemini-code-assist · 2026-02-04T14:12:01Z

Summary of Changes

Hello @jaytiwarihub, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request lays the groundwork for the Qwen2.5-VL model, integrating vision processing capabilities into the existing Qwen language model. It includes the core components for encoding visual information, projecting it into the text embedding space, and fusing it with text embeddings for multimodal processing. The changes also include modifications to the preprocessor to handle text-only inputs gracefully.

Highlights

Qwen2.5-VL Architecture Implementation: This pull request introduces the initial skeleton for the Qwen2.5-VL architecture, including the vision encoder, projector, and backbone.
Vision Encoder (ViT Structure): The Qwen2VLVisionEncoder is defined, implementing a Vision Transformer (ViT) structure for processing image inputs.
Vision to Text Projection: The Qwen2VLProjector is defined to downsample vision features and project them into the text embedding space, facilitating fusion with text embeddings.
Backbone Wiring: The Qwen2VLBackbone connects the vision processing components to the existing Qwen text backbone, enabling multimodal processing.
Image Handling: The preprocessor is modified to handle None images, skipping vision layers when only text input is provided.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

keras_hub/src/models/gemma3/gemma3_causal_lm_preprocessor.py
- Modified to handle None images by skipping vision layers when only text input is provided.
- Removed unused responses parameter from generate_preprocess function.
- Removed unnecessary calculations when images is None
keras_hub/src/models/qwen2_vl/qwen2_vl_backbone.py
- Added Qwen2VLBackbone class to wire vision components to the existing Qwen text backbone.
- Implements the forward pass for processing images and text, fusing their embeddings, and passing them through the LLM.
keras_hub/src/models/qwen2_vl/qwen2_vl_projector.py
- Added Qwen2VLProjector class to downsample vision features and project them into the LLM's hidden size.
- Includes logic for merging 2x2 neighboring patches into a single token.
keras_hub/src/models/qwen2_vl/qwen2_vl_vision_encoder.py
- Added Qwen2VLVisionEncoder class implementing a Vision Transformer (ViT) structure.
- Includes a 3D convolution layer for handling video and images, transformer blocks, and a patch merger.

Activity

Initial implementation of Qwen2.5-VL architecture.
Definition of vision encoder, projector, and backbone components.
Modification of preprocessor to handle None images.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces the initial skeleton for the Qwen2.5-VL model, including the backbone, vision encoder, and projector components, and refactors the Gemma3 preprocessor. While a good start, several areas require attention to align with repository standards. Specifically, new backbones should utilize the Keras Functional API, docstrings and get_config methods are missing, and there are implementation issues in Qwen2VLBackbone and Qwen2VLVisionBlock. Additionally, a minor code duplication issue was found in gemma3_causal_lm_preprocessor.py.

gemini-code-assist · 2026-02-04T14:14:26Z

keras_hub/src/models/qwen2_vl/qwen2_vl_backbone.py

+        combined_embeddings = keras.ops.concatenate([image_embeddings, text_embeddings], axis=1)
+
+        # Pass through the LLM
+        x = self.text_backbone.transformer_layers(combined_embeddings)


self.text_backbone.transformer_layers is a list of layers, not a single callable layer. This will raise an error. You need to iterate through the layers in a loop.

Additionally, this approach of accessing internal layers of self.text_backbone breaks encapsulation and is brittle. It would be better to either reuse the text_backbone's call method or restructure the model. The padding_mask input is also missing from the call to the transformer layers.

keras_hub/src/models/qwen2_vl/qwen2_vl_vision_encoder.py

keras_hub/src/models/qwen2_vl/qwen2_vl_backbone.py

keras_hub/src/models/qwen2_vl/qwen2_vl_projector.py

keras_hub/src/models/qwen2_vl/qwen2_vl_vision_encoder.py

gemini-code-assist · 2026-02-04T14:14:26Z

keras_hub/src/models/gemma3/gemma3_causal_lm_preprocessor.py

        # We will squeeze these back at the end.
+        batched = True
+
        batched = True


This batched = True assignment is a duplicate of the one on line 681 and can be removed.

gemini-code-assist · 2026-02-04T14:14:26Z

keras_hub/src/models/qwen2_vl/qwen2_vl_backbone.py

@@ -0,0 +1,41 @@
+import keras
+from keras_hub.src.models.backbone import Backbone
+from keras_hub.src.models.qwen.qwen_backbone import QwenBackbone 


There is a trailing whitespace at the end of this line.

Suggested change

from keras_hub.src.models.qwen.qwen_backbone import QwenBackbone

from keras_hub.src.models.qwen.qwen_backbone import QwenBackbone

gemini-code-assist · 2026-02-04T14:14:26Z

keras_hub/src/models/qwen2_vl/qwen2_vl_vision_encoder.py

+
+    def call(self, x, grid_thw=None):
+        # x shape: (Batch, Time, Height, Width, Channels)
+        x = self.patch_embed(x) 


There is a trailing whitespace at the end of this line.

Suggested change

x = self.patch_embed(x)

x = self.patch_embed(x)

…de fixes

jaytiwarihub · 2026-02-07T18:05:17Z

Note to maintainers: Please update the PR title to [Model] Add Qwen2-VL Model Architecture and Preprocessing (I cannot see the edit button on my end)

samudraneel05 · 2026-02-17T19:38:52Z

keras_hub/src/models/qwen2_vl/__init__.py

+from keras_hub.src.models.qwen2_vl.qwen2_vl_causal_lm import Qwen2VLCausalLM
+from keras_hub.src.models.qwen2_vl.qwen2_vl_projector import Qwen2VLProjector
+from keras_hub.src.models.qwen2_vl.qwen2_vl_vision_encoder import (
+    Qwen2VLVisionEncoder,
+)


Suggested change

from keras_hub.src.models.qwen2_vl.qwen2_vl_causal_lm import Qwen2VLCausalLM

from keras_hub.src.models.qwen2_vl.qwen2_vl_projector import Qwen2VLProjector

from keras_hub.src.models.qwen2_vl.qwen2_vl_vision_encoder import (

Qwen2VLVisionEncoder,

)

from keras_hub.src.models.qwen2_vl.qwen2_vl_presets import backbone_presets

from keras_hub.src.utils.preset_utils import register_presets

register_presets(backbone_presets, Qwen2VLBackbone)

the other imports inside the init are unecessary and go against repo standards

samudraneel05 · 2026-02-17T19:42:46Z

keras_hub/src/models/qwen2_vl/qwen2_vl_backbone.py

+    """Qwen2-VL Backbone model.
+
+    This backbone combines the Vision Encoder and the Text Backbone.
+    It follows the KerasHub Functional API pattern.


weird line, better to remove and instead add parameters or args

samudraneel05 · 2026-02-17T19:51:35Z

keras_hub/src/models/qwen2_vl/qwen2_vl_backbone.py

+# Copyright 2024 The KerasHub Authors
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     https://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.


Suggested change

# Copyright 2024 The KerasHub Authors

#

# Licensed under the Apache License, Version 2.0 (the "License");

# you may not use this file except in compliance with the License.

# You may obtain a copy of the License at

#

# https://www.apache.org/licenses/LICENSE-2.0

#

# Unless required by applicable law or agreed to in writing, software

# distributed under the License is distributed on an "AS IS" BASIS,

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

# See the License for the specific language governing permissions and

# limitations under the License.

why is this here when it does not exist on any other model's backbone file?

samudraneel05 · 2026-02-17T19:53:51Z

keras_hub/src/models/qwen2_vl/__init__.py

+# Copyright 2024 The KerasHub Authors
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     https://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.


Suggested change

# Copyright 2024 The KerasHub Authors

#

# Licensed under the Apache License, Version 2.0 (the "License");

# you may not use this file except in compliance with the License.

# You may obtain a copy of the License at

#

# https://www.apache.org/licenses/LICENSE-2.0

#

# Unless required by applicable law or agreed to in writing, software

# distributed under the License is distributed on an "AS IS" BASIS,

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

# See the License for the specific language governing permissions and

# limitations under the License.

is not present in any other model's init file, curious as to why you put it here?

sachinprasadhs · 2026-02-17T20:35:29Z

@jaytiwarihub, As you have mentioned the implementation is about Qwen2.5-VL, could you please rename the directories accordingly.
Also, transformers have different directory and implementation for Qwen2.5-VL and Qwen2-VL
So, @samudraneel05 You can continue on Qwen2-VL and @jaytiwarihub will work on Qwen2.5-VL.
And we will have a separate Qwen3-VL implementation.

All the communications are made clear in these issue threads and respective contributors working on these are assigned for these issues as well.
#2172
#2323
#2570

Hope this clarifies all the confusion.
Thanks for showing interest in contributing the models.

jaytiwarihub · 2026-02-18T05:41:09Z

@samudraneel05 thanks for help, i appreciated and those weird lines are just overwhelm feeling of mine , i'll take care of it

sachinprasadhs

Thanks for the PR.
I just went through the PR high level, the code looks incomplete and also does not follow the Keras Hub design principles and guidelines.
Please refer https://github.com/keras-team/keras-hub/blob/master/CONTRIBUTING_MODELS.md for details

jaytiwarihub · 2026-02-20T07:16:24Z

@sachinprasadhs thankyou for your kind review ! , I'm working on it

jaytiwarihub added 5 commits January 23, 2026 00:43

Refactor: Remove unused 'responses' arg from Gemma3 generate_preprocess

9876249

Fix: Apply ruff formatting

fc181bf

Perf: Pass None for missing images instead of empty tensors

309fb75

Fix: Remove unused variables and apply formatting

565f2d9

Implement Qwen2-VL skeleton (Backbone, Encoder, Projector)

bd58f5f

gemini-code-assist bot reviewed Feb 4, 2026

View reviewed changes

jaytiwarihub added 6 commits February 4, 2026 19:45

Remove accidental Gemma3 file from PR

a6676d9

Refactor Qwen2-VL skeleton: functional API, docstrings, and style gui…

b382cab

…de fixes

Add Qwen2VLRotaryEmbedding skeleton and update call path

bd94a92

Implement Qwen2VLCausalLM and finalized RoPE logic

9145d81

Add Qwen2-VL model architecture and preprocessing

417d20b

Fix linting errors and implement Qwen2-VL components

404dd11

jaytiwarihub marked this pull request as ready for review February 7, 2026 17:48

sachinprasadhs added the new model For PRs that contribute a new model to the Keras Hub registry. label Feb 9, 2026

jaytiwarihub changed the title ~~[WIP] Port Qwen2.5-VL Model~~ Port Qwen2.5-VL Model Feb 13, 2026

Fix: Format code with 80 line length

ecf15b6

This was referenced Feb 16, 2026

Add Qwen2_VL #2599

Open

Add Qwen2-VL to KerasHub #2323

Open

samudraneel05 reviewed Feb 17, 2026

View reviewed changes

sachinprasadhs reviewed Feb 20, 2026

View reviewed changes

	from keras_hub.src.models.qwen.qwen_backbone import QwenBackbone
	from keras_hub.src.models.qwen.qwen_backbone import QwenBackbone

Conversation

jaytiwarihub commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Title

PR Description

Uh oh!

gemini-code-assist bot commented Feb 4, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

jaytiwarihub commented Feb 7, 2026

Uh oh!

samudraneel05 Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

samudraneel05 Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

samudraneel05 Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

samudraneel05 Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

sachinprasadhs commented Feb 17, 2026

Uh oh!

jaytiwarihub commented Feb 18, 2026

Uh oh!

sachinprasadhs left a comment

Choose a reason for hiding this comment

Uh oh!

jaytiwarihub commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jaytiwarihub commented Feb 4, 2026 •

edited

Loading

samudraneel05 Feb 17, 2026 •

edited

Loading